Search CORE

93 research outputs found

Tight Lower Bounds for Data-Dependent Locality-Sensitive Hashing

Author: Andoni Alexandr
Razenshteyn Ilya
Publication venue
Publication date: 01/01/2015
Field of study

We prove a tight lower bound for the exponent

\rho

for data-dependent Locality-Sensitive Hashing schemes, recently used to design efficient solutions for the

c

-approximate nearest neighbor search. In particular, our lower bound matches the bound of

\rho\le \frac{1}{2c-1}+o(1)

for the

\ell_1

space, obtained via the recent algorithm from [Andoni-Razenshteyn, STOC'15]. In recent years it emerged that data-dependent hashing is strictly superior to the classical Locality-Sensitive Hashing, when the hash function is data-independent. In the latter setting, the best exponent has been already known: for the

\ell_1

space, the tight bound is

\rho=1/c

, with the upper bound from [Indyk-Motwani, STOC'98] and the matching lower bound from [O'Donnell-Wu-Zhou, ITCS'11]. We prove that, even if the hashing is data-dependent, it must hold that

\rho\ge \frac{1}{2c-1}-o(1)

. To prove the result, we need to formalize the exact notion of data-dependent hashing that also captures the complexity of the hash functions (in addition to their collision properties). Without restricting such complexity, we would allow for obviously infeasible solutions such as the Voronoi diagram of a dataset. To preclude such solutions, we require our hash functions to be succinct. This condition is satisfied by all the known algorithmic results.Comment: 16 pages, no figure

arXiv.org e-Print Archive

CiteSeerX

Dagstuhl Research Online Publication Server

Global Alignment of Molecular Sequences via Ancestral State Reconstruction

Author: Andoni Alexandr
Daskalakis Constantinos
Hassidim Avinatan
Roch Sebastien
Publication venue
Publication date: 01/01/2009
Field of study

Molecular phylogenetic techniques do not generally account for such common evolutionary events as site insertions and deletions (known as indels). Instead tree building algorithms and ancestral state inference procedures typically rely on substitution-only models of sequence evolution. In practice these methods are extended beyond this simplified setting with the use of heuristics that produce global alignments of the input sequences--an important problem which has no rigorous model-based solution. In this paper we consider a new version of the multiple sequence alignment in the context of stochastic indel models. More precisely, we introduce the following {\em trace reconstruction problem on a tree} (TRPT): a binary sequence is broadcast through a tree channel where we allow substitutions, deletions, and insertions; we seek to reconstruct the original sequence from the sequences received at the leaves of the tree. We give a recursive procedure for this problem with strong reconstruction guarantees at low mutation rates, providing also an alignment of the sequences at the leaves of the tree. The TRPT problem without indels has been studied in previous work (Mossel 2004, Daskalakis et al. 2006) as a bootstrapping step towards obtaining optimal phylogenetic reconstruction methods. The present work sets up a framework for extending these works to evolutionary models with indels

arXiv.org e-Print Archive

CiteSeerX

DSpace@MIT

Lower Bounds on Time-Space Trade-Offs for Approximate Near Neighbors

Author: Andoni Alexandr
Laarhoven Thijs
Razenshteyn Ilya
Waingarten Erik
Publication venue
Publication date: 01/01/2016
Field of study

We show tight lower bounds for the entire trade-off between space and query time for the Approximate Near Neighbor search problem. Our lower bounds hold in a restricted model of computation, which captures all hashing-based approaches. In articular, our lower bound matches the upper bound recently shown in [Laarhoven 2015] for the random instance on a Euclidean sphere (which we show in fact extends to the entire space

\mathbb{R}^d

using the techniques from [Andoni, Razenshteyn 2015]). We also show tight, unconditional cell-probe lower bounds for one and two probes, improving upon the best known bounds from [Panigrahy, Talwar, Wieder 2010]. In particular, this is the first space lower bound (for any static data structure) for two probes which is not polynomially smaller than for one probe. To show the result for two probes, we establish and exploit a connection to locally-decodable codes.Comment: 47 pages, 2 figures; v2: substantially revised introduction, lots of small corrections; subsumed by arXiv:1608.03580 [cs.DS] (along with arXiv:1511.07527 [cs.DS]

arXiv.org e-Print Archive

Repository TU/e

Pure OAI Repository

Spectral Approaches to Nearest Neighbor Search

Author: Abdullah Amirali
Andoni Alexandr
Kannan Ravindran
Krauthgamer Robert
Publication venue
Publication date: 04/08/2014
Field of study

We study spectral algorithms for the high-dimensional Nearest Neighbor Search problem (NNS). In particular, we consider a semi-random setting where a dataset

P

\mathbb{R}^d

is chosen arbitrarily from an unknown subspace of low dimension

k\ll d

, and then perturbed by fully

d

-dimensional Gaussian noise. We design spectral NNS algorithms whose query time depends polynomially on

d

and

\log n

(where

n=|P|

) for large ranges of

k

d

and

n

. Our algorithms use a repeated computation of the top PCA vector/subspace, and are effective even when the random-noise magnitude is {\em much larger} than the interpoint distances in

P

. Our motivation is that in practice, a number of spectral NNS algorithms outperform the random-projection methods that seem otherwise theoretically optimal on worst case datasets. In this paper we aim to provide theoretical justification for this disparity.Comment: Accepted in the proceedings of FOCS 2014. 30 pages and 4 figure

arXiv.org e-Print Archive

CiteSeerX

Parallel Algorithms for Geometric Graph Problems

Author: Andoni Alexandr
Nikolov Aleksandar
Onak Krzysztof
Yaroslavtsev Grigory
Publication venue
Publication date: 01/01/2014
Field of study

We give algorithms for geometric graph problems in the modern parallel models inspired by MapReduce. For example, for the Minimum Spanning Tree (MST) problem over a set of points in the two-dimensional space, our algorithm computes a

(1+\epsilon)

-approximate MST. Our algorithms work in a constant number of rounds of communication, while using total space and communication proportional to the size of the data (linear space and near linear time algorithms). In contrast, for general graphs, achieving the same result for MST (or even connectivity) remains a challenging open problem, despite drawing significant attention in recent years. We develop a general algorithmic framework that, besides MST, also applies to Earth-Mover Distance (EMD) and the transportation cost problem. Our algorithmic framework has implications beyond the MapReduce model. For example it yields a new algorithm for computing EMD cost in the plane in near-linear time,

n^{1+o_\epsilon(1)}

. We note that while recently Sharathkumar and Agarwal developed a near-linear time algorithm for

(1+\epsilon)

-approximating EMD, our algorithm is fundamentally different, and, for example, also solves the transportation (cost) problem, raised as an open question in their work. Furthermore, our algorithm immediately gives a

(1+\epsilon)

-approximation algorithm with

n^{\delta}

space in the streaming-with-sorting model with

1/\delta^{O(1)}

passes. As such, it is tempting to conjecture that the parallel models may also constitute a concrete playground in the quest for efficient algorithms for EMD (and other similar problems) in the vanilla streaming model, a well-known open problem

arXiv.org e-Print Archive

CiteSeerX

Edit Distance in Near-Linear Time: it's a Constant Factor

Author: Andoni Alexandr
Nosatzki Negev Shekel
Publication venue
Publication date: 15/05/2020
Field of study

We present an algorithm for approximating the edit distance between two strings of length

n

in time

n^{1+\epsilon}

, for any

\epsilon>0

, up to a constant factor. Our result completes the research direction set forth in the recent breakthrough paper [Chakraborty-Das-Goldenberg-Koucky-Saks, FOCS'18], which showed the first constant-factor approximation algorithm with a (strongly) sub-quadratic running time. Several recent results have shown near-linear complexity under different restrictions on the inputs (eg, when the edit distance is close to maximal, or when one of the inputs is pseudo-random). In contrast, our algorithm obtains a constant-factor approximation in near-linear running time for any input strings

arXiv.org e-Print Archive